Explore City venues (Kolkata)

The city venue data will be use to explore the cities.

  • Firstly the most no a venue will be extracted from a particular city.
  • Then the top 10 most common venue for each neighborhood will be extracted and then the neighborhood will be clustered by kmeans on the basis of most common venues
In [294]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Initialize an array of cities in which we are interested in

In [295]:
# initialize cities array and perform exploratory data analysis on the selected city
cities = ['Delhi','Mumbai','Kolkata','Chennai']
city = cities[2]
city_venues = pd.read_csv(city + '_venues.csv',index_col = 0)
city_venues.head()
Out[295]:
Neighborhood Neighborhood Latitude Neighborhood Longitude Venue Venue Latitude Venue Longitude Venue Category Venue Summary Venue Type
0 Kalyani Municipality 22.570539 88.371239 Bhim Chandra Nag 22.570639 88.371524 Indian Sweet Shop This spot is popular general
1 Kalyani Municipality 22.570539 88.371239 Big Bazaar 22.565919 88.369635 Department Store This spot is popular general
2 Kalyani Municipality 22.570539 88.371239 Indian Coffee House 22.576187 88.364013 Café This spot is popular general
3 Kalyani Municipality 22.570539 88.371239 Paramount 22.573874 88.364496 Juice Bar This spot is popular general
4 Kalyani Municipality 22.570539 88.371239 Café Coffee Day 22.565919 88.369635 Café This spot is popular general

Exploratory data Analysis

In [296]:
# see number of venues per neighbourhood
city_venues.groupby('Neighborhood').count().head()
Out[296]:
Neighborhood Latitude Neighborhood Longitude Venue Venue Latitude Venue Longitude Venue Category Venue Summary Venue Type
Neighborhood
Baidyabati Municipality 10 10 10 10 10 10 10 10
Bally Municipality 10 10 10 10 10 10 10 10
Bansberia Municipality 10 10 10 10 10 10 10 10
Baranagar Municipality 10 10 10 10 10 10 10 10
Barasat Municipality 10 10 10 10 10 10 10 10

One hotting the dataframe

In [297]:
# one hot encoding
city_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
city_onehot['Neighborhood'] = city_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [city_onehot.columns[-1]] + list(city_onehot.columns[:-1])
city_onehot = city_onehot[fixed_columns]

city_onehot['City'] = city
city_onehot.head()
Out[297]:
Neighborhood ATM Art Gallery Asian Restaurant Bakery Bank Bus Station Business Service Café Campground ... Pharmacy Platform Plaza Pool Restaurant Swiss Restaurant Tea Room Train Station Watch Shop City
0 Kalyani Municipality 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 Kolkata
1 Kalyani Municipality 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 Kolkata
2 Kalyani Municipality 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 Kolkata
3 Kalyani Municipality 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 Kolkata
4 Kalyani Municipality 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 Kolkata

5 rows × 40 columns

In a city finding the total no. of venues of a particular type and arranginf the dataframe

In [298]:
# group the data per neighborhood
city_grouped = city_onehot.groupby('City').sum().reset_index()
city_grouped = city_grouped.transpose()
city_grouped.columns = city_grouped.iloc[0]
city_grouped.drop(city_grouped.index[[0]],inplace=True)
city_grouped.head()
Out[298]:
City Kolkata
ATM 9
Art Gallery 1
Asian Restaurant 2
Bakery 2
Bank 1

Sorting the data frame on the basis of frequency, taking the top 10 and plotting them.

In [299]:
city_grouped.sort_values([city],ascending=False,inplace=True)
city_grouped.iloc[0:10].plot(kind='bar',figsize=(10,5))
plt.title(city + ' Venue distribution')
plt.ylabel('Total no of venues in the City')
plt.xlabel('Venue Categories')
Out[299]:
Text(0.5, 0, 'Venue Categories')

Now we will find the top 10 most common venues

In [300]:
# First, let's write a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
In [301]:
city_grouped = city_onehot.groupby('Neighborhood').mean().reset_index()
city_grouped.head()
Out[301]:
Neighborhood ATM Art Gallery Asian Restaurant Bakery Bank Bus Station Business Service Café Campground ... Performing Arts Venue Pharmacy Platform Plaza Pool Restaurant Swiss Restaurant Tea Room Train Station Watch Shop
0 Baidyabati Municipality 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 ... 0.0 0.0 0.2 0.1 0.1 0.0 0.0 0.0 0.1 0.0
1 Bally Municipality 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 ... 0.0 0.0 0.2 0.1 0.1 0.0 0.0 0.0 0.1 0.0
2 Bansberia Municipality 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 ... 0.0 0.0 0.2 0.1 0.1 0.0 0.0 0.0 0.1 0.0
3 Baranagar Municipality 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 ... 0.0 0.0 0.2 0.1 0.1 0.0 0.0 0.0 0.1 0.0
4 Barasat Municipality 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 ... 0.0 0.0 0.2 0.1 0.1 0.0 0.0 0.0 0.1 0.0

5 rows × 39 columns

In [302]:
# Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = city_grouped['Neighborhood']

for ind in np.arange(city_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()
Out[302]:
Neighborhood 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
0 Baidyabati Municipality Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
1 Bally Municipality Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
2 Bansberia Municipality Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
3 Baranagar Municipality Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
4 Barasat Municipality Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station

After finding the top 10 venues in a neighborhood, the neighborhood will be clustered by kmeans algorithm

In [303]:
# import necessary packages
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

Find optimal k for the clustering process.

In [304]:
city_grouped_clustering = city_grouped.drop('Neighborhood', 1)

sse = {}
for k in range(1, 10):
    # run k-means clustering
    kmeans = KMeans(n_clusters=k, random_state=0,max_iter = 1000).fit(city_grouped_clustering)
    city_grouped_clustering["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

From the graph we find the elbow point an then use it to cluster the neighborhoods.

In [305]:
# set number of clusters
kclusters = 5

city_grouped_clustering = city_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 
Out[305]:
array([1, 1, 1, 1, 1, 1, 1, 1, 3, 1])
In [306]:
df = pd.read_csv(city + '_subdiv.csv',index_col=0)
df.head()
Out[306]:
Neighborhood City Latitude Longitude
0 Kalyani Municipality Kolkata 22.570539 88.371239
1 Gayespur Municipality Kolkata 22.570539 88.371239
2 Kanchrapara Municipality Kolkata 22.951059 88.431023
3 Halisahar Municipality Kolkata 22.570539 88.371239
4 Naihati Municipality Kolkata 22.895760 88.428757

Add clustering labels to the dataframe

In [307]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

city_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
city_merged = city_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# drop the column with nan values after join
city_merged.dropna(inplace=True)

city_merged.head() # check the last columns!
Out[307]:
Neighborhood City Latitude Longitude Cluster Labels 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
0 Kalyani Municipality Kolkata 22.570539 88.371239 1.0 Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
1 Gayespur Municipality Kolkata 22.570539 88.371239 1.0 Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
2 Kanchrapara Municipality Kolkata 22.951059 88.431023 0.0 ATM Chinese Restaurant Farmers Market Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop Cocktail Bar Campground
3 Halisahar Municipality Kolkata 22.570539 88.371239 1.0 Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
4 Naihati Municipality Kolkata 22.895760 88.428757 0.0 ATM Electronics Store Chinese Restaurant Farmers Market Department Store Deli / Bodega Convenience Store Coffee Shop Cocktail Bar Campground

After clustering the data will be visualized in folium.

In [308]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

Create a function to ge the lat and long of the center of city using geocoder API

In [309]:
# use geocoder library, if not present use !conda install -c conda-forge geocoder
import geocoder
# Google API key is required for the geocoder library to work, save the API key in OS environment variables as GOOGLE_API_KEY
# and then access thay key here
import os
# Use BING_API_KEY when choosing to use bing geocoding instead of google geocoding.
BING_API_KEY = 'AksNN-3luSfNBssyZ3Ju4i78nIrFLt1UtYo--YWQj9oyfxSwyXkdsqykWk3FeTXB' # os.environ['BING_API_KEY']

# This function will take an adress and return the latlng of that adress
def get_latlng(address):
    # using bing geocoder API since it is better.
    g = geocoder.bing(address, key = BING_API_KEY)
    return pd.Series(g.latlng)
In [310]:
# get latitude and longitude of city to center the map
latitude, longitude = get_latlng(city)
print('Lat : ',latitude,' Long : ',longitude)
Lat :  22.570539474487305  Long :  88.3712387084961

Visualize the neighborhoods before clustering

In [311]:
# Function takes in a data frame with Latitude, Longitude, Neighborhood and City columns and shows it on map
def visualize_area_in_map(data):
    # add markers to map
    for lat, lng, neighborhood, city in zip(data['Latitude'], data['Longitude'], data['Neighborhood'], data['City']):
        label = '{}, {}'.format(neighborhood, city)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=2,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map)  
    
    return map
In [312]:
# create map of Toronto using latitude and longitude values
map = folium.Map(location=[latitude, longitude], zoom_start=10)

# data to be used for map
data = df.dropna()

visualize_area_in_map(data)
Out[312]:

Visualize the clusters

In [313]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged['Latitude'], city_merged['Longitude'], city_merged['Neighborhood'], city_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=4,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters
Out[313]:

Exploring the clusters

In [314]:
# print the cluster
city_merged.loc[city_merged['Cluster Labels'] == 0, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]
Out[314]:
City 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
2 Kolkata ATM Chinese Restaurant Farmers Market Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop Cocktail Bar Campground
4 Kolkata ATM Electronics Store Chinese Restaurant Farmers Market Department Store Deli / Bodega Convenience Store Coffee Shop Cocktail Bar Campground
6 Kolkata ATM Performing Arts Venue Train Station Asian Restaurant Cocktail Bar Farmers Market Electronics Store Department Store Deli / Bodega Convenience Store
14 Kolkata ATM IT Services Bakery Bus Station Cocktail Bar Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop
15 Kolkata ATM Restaurant Bank Pharmacy Chinese Restaurant Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop
17 Kolkata ATM Bakery Market Bus Station Park Train Station Bank Asian Restaurant Business Service Café
37 Kolkata ATM Train Station Chinese Restaurant Farmers Market Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop Cocktail Bar
In [315]:
city_merged.loc[city_merged['Cluster Labels'] == 1, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]
Out[315]:
City 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
0 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
1 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
3 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
7 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
8 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
9 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
11 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
13 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
16 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
18 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
19 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
20 Kolkata Watch Shop Multiplex Asian Restaurant Café Cocktail Bar Deli / Bodega Fast Food Restaurant Food Court Tea Room Swiss Restaurant
21 Kolkata Farmers Market Asian Restaurant Hotel Indian Restaurant Business Service Cocktail Bar Electronics Store Department Store Deli / Bodega Convenience Store
22 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
25 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
26 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
27 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
28 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
29 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
30 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
32 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
33 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
34 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
38 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
39 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
40 Kolkata Café Platform Plaza Train Station Indian Sweet Shop Juice Bar Department Store Pool Bank Bus Station
41 Kolkata Train Station Restaurant Coffee Shop Café Chinese Restaurant Farmers Market Electronics Store Department Store Deli / Bodega Convenience Store
42 Kolkata Chinese Restaurant Hostel Hotel Bus Station Campground Market Cocktail Bar Department Store Deli / Bodega Convenience Store
In [316]:
city_merged.loc[city_merged['Cluster Labels'] == 2, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]
Out[316]:
City 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
10 Kolkata Train Station Watch Shop Chinese Restaurant Farmers Market Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop Cocktail Bar
23 Kolkata Convenience Store Train Station Watch Shop Chinese Restaurant Farmers Market Electronics Store Department Store Deli / Bodega Coffee Shop Cocktail Bar
In [317]:
city_merged.loc[city_merged['Cluster Labels'] == 3, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]
Out[317]:
City 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
5 Kolkata Platform Watch Shop Campground Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop Cocktail Bar Chinese Restaurant
In [318]:
city_merged.loc[city_merged['Cluster Labels'] == 4, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]
Out[318]:
City 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
12 Kolkata Art Gallery Indian Restaurant Pharmacy Watch Shop Chinese Restaurant Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop
35 Kolkata Pharmacy Train Station Indian Sweet Shop Chinese Restaurant Electronics Store Department Store Deli / Bodega Convenience Store Coffee Shop Cocktail Bar
36 Kolkata Pharmacy ATM Train Station Asian Restaurant Cocktail Bar Farmers Market Electronics Store Department Store Deli / Bodega Convenience Store
In [ ]: